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Who's talking? 
A network plumber with hardware and software clue. 


e Network architecture and engineering. 
e BGP, Verilog, C. 

e More packets, less seconds. 

e System-level design. 

e Ex-Cisco, ex-Equinix. 


e Fragments of production systems. 


The Context 


e Vector extensions have been present on x86 platforms in various forms 
for over a quarter of century. 

e And have accumulated a substantial amount of misconceptions and 
myths. 


A look from two sides: 


e How one could use it in domains other than High-Performance 
Computing. 
e How one could use it without a resulting loss of performance. 
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Vectorization? 


e The logic of a scalar operation (and eventually an 
algorithm) applied to multiple sets of data. 

e Not a novel concept as such, has been around for 
half a century. 

e Predominantly a domain of HPC and specialized 
compute platforms. 

e Performance characterization in terms of latency and 
throughput. 

e SIMD as a specific instantiation of vectorization 
approach on x86 platforms. 


Practical and Everyday: applicable to a multitude of generic and often-encountered 
scenarios on contemporary commodity platforms. 


Why bother? 


Compiler is smart, one just needs to specify a correct command-line option, 
no? 


= 0° p < k; ptt) { 
C[1] [j] += A[i] [p] * Bip] [jl]; 


For vertical operations that is indeed not a complex task to do, and 
compilers perform just fine. 


Why bother? 


uint32 t CityHash32 (const char *s, size t len) { 


if (len <= 24) { 
return len <= 12 ? (len <= 4 ? 
Hash32Len0to4(s, len) 


> e- = z h32 5t012(s, 1 I ; 
What if the level of triviality is reduced isa Sars 
a little”? 
h ^= a0; 
h = h * 5 + 0xe6546b64; 
f h ^= a2; 
e Control flow dependencies. ° 
e Different lane widths and parameters a ad 
“Branching hoh * 5 + Oxe6546b64 
i i = * xe ; 
e Iterations of different length £ t= al 
s += ; 
} while (--iters != 0); 
h= h * 5 + 0xe6546b64; 
h = Rotate32(h, 17) * cl; 
, : 1 h = Rotate32(h + £, 19); 
missed: not vectorized: complicated access pattern. _ è + . 
missed: not vectorized: no vectype for stmt: sum[0] = { 0, 0, 0, 0 }; r _ a S45 © oA: 


missed: couldn't vectorize loop 


return h; 


SHA2 example 


e A block-based cryptographic hash 
function, in widespread use. 

e Initial message is split into fixed 
length blocks. 

e The same transformation function is 
applied to each block. 

e Transient state is carried between 
adjacent blocks only. 


There are more tiny details of SHA2 that are not 
relevant to our discussion. 


SHA2 example 


Each block is processed independently, no 
internal state is leaked outside. 

Simple bitwise operations — many of them. 
Each operation has a fixed width of 32 bits. 
It is a reasonably widespread operation, 
and processor vendors started providing 


specific instructions tailored specifically at 
SHA2 processing. 


Instruction set equivalence across different 
processor families may not be assumed. 


SHA2 example — some results 


* Scalar implementation taken as a 


reference. Latency Throughput 
e Out of order and pipelining aspects 

considered equally for all methods. Scalar 1 f 
e Microbenchmarking may not provide Accelerated 4 È 


reliable system-level results! Vectorized 0.85 16 


SHA2 experiments 


e Load full width register and perform all 
the operations of the SHA2 round on it. 


e Just like a wider scalar version. 


SHA2 experiments FAIL 


e |t does not work this way — most of vector 
operations are vertical, not horizontal. 


e Wide memory operations hide memory 
hierarchy latency and are strongly preferred 
overall. 


e Efficiency of memory subsystem operations Is 
a byproduct of vectorization. 


SHA2 experiments 


e Load multiple full registers. 
e Transpose to vertical data layout. 
e Perform a round on a set of registers 


SHA2 experiments FAIL 


e Blocks are bound together by an inter- 
block state transfer. 

e Technically can be made to work, but 
at what cost? 

e Resulting throughput -> 1, resulting 
latency -> +inf. 


SHA2 experiments 


e Load multiple full registers with 
fragments of multiple messages. 

e Transpose to vertical data layout. 

e Perform a round on a set of registers 


SHA2 experiments 


e No cross-message block 
dependency. 

e Throughput approaches vector 
width and vector element size 
ratio. 

e Latency will be similar to a scalar 
option, many influencing factors. 

e Latency might be better than 
scalar option in some specific 
cases — yet more influencing 
factors. 
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SHA2 experiments — some extrapolation 


e Can this be autovectorized? 

e Memory and caching subsystem influence. 

e Handling of non-regular cases — both data flow and control flow. 

e Specialty accelerators vs vectorization vs pipelined scalar execution. 
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Hash tables (maps) 


e Two logical parts — calculate a hash value and place 
into corresponding storage. 

e Memory intensive — but with different properties of 
intensity. 

e Separate independent Iterations. 

e The hash function itself is comprised of arithmetical 
and bit operations. 

e Logical and bit operations ideally fit vector element 
width. 

e Multiplication can overflow though! 

e Data fetch via gather or linear load plus transposition. 

e Scalar prologue and epilogue might result in better 
overall performance. 

e Linear execution time across all lanes, throughput vs 
goodput. 
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Tree-like Structures 


e [ree traversal results in control flow branching. That 
needs to be translated into data flow masking. 

e Initial pre-sorting based on search criteria parameters 
may result in observable performance increase. 

e Dealing with 64-bit pointers when only 32-bit elements 
are available results in substantial slowdown. 

e Unrolled scalar pass may result in comparable or better 
performance overall. 
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Long Integer arithmetic 


e Widely used in cryptography. 

e A large integer is represented as a set of narrow limbs. 
e Operations performed on limbs vertically. 

e Specific algorithms generally stay the same as for a scalar approach. 
e Cost of arithmetic operations generally outweighs memory access. 
e “Free” constant-time execution for a group of lanes. 


al 


Combining it all together- BGPsec 


100 (6 + 94) 100 100 5+ 
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Linear code block operating on different data 
sets in parallel 


Hash multiple blocks in parallel 
Sign/verify multiple hashes/signatures in 
parallel 


Vector lanes of fixed width — multiplication 
restricts lane density 


Gather operations place significant 
restrictions on data format 


+20% latency results in +1200% throughput 


Only if data structures allow! 


Where's the complexity anyway”? 


eIn order to vectorize scalar code successfully, flexible horizontal operations are 
mandatory. 

«Control flow dependencies need to be translated to data flow dependencies. 

«Data structure layout should not conflict with native vector width and length. 


*Controllable/maskable access to memory 

e Assembling a vector register from scalar values (insert/extract) 

e Mixing multiple vectors into one (blend) 

e Swapping blocks and elements within a vector (permute, shuffle) 
*Nonlinear and indexed memory access (scatter, gather) 

e Conditional execution (predicates, masking) 
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Assorted Aspects 


*Atomicity of vector operations. 

«Memory ordering. 

*Domain-specific extensions (FMA52, GF2, AMX). 
eInterworking with an OS, state management and context switching. 
*Alignment, speculative execution and prefetching. 
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Controversial Topics 


sIntrinsics in 20227 \ A 


els AVX-512 obsolete? 

«What about alternative platforms? 
ARM, RISC-V? 

«What about portability? 

«What about compiler support? 

‘What's next beyond AVX-512° 

«Variable vector length and width? 
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Discussion 
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Leave your feedback! 


You can rate the talk and 
give a feedback on what 
you ve liked or what could 
be improved 
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